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Abstract —Most sampling techniques for online social networks 
(OSNs) are based on a particular sampling method on a single 
graph, which is referred to as a statistic. However, various 
realizing methods on different graphs could possibly be used 
in the same OSN, and they may lead to different sampling 
efficiencies, i.e., asymptotic variances. To utiUze multiple statistics 
for accurate measurements, we formulate a mixture sampling 
problem, through which we construct a mixture unbiased es¬ 
timator which minimizes the asymptotic variance. Given fixed 
sampling budgets for different statistics, we derive the optimal 
weights to combine the individual estimators; given a fixed total 
budget, we show that a greedy allocation towards the most 
efficient statistic is optimal. In practice, the sampling efficiencies 
of statistics can be quite different for various targets and are 
unknown before sampling. To solve this problem, we design a 
two-stage framework which adaptively spends a partial budget to 
test different statistics and allocates the remaining budget to the 
inferred best statistic. We show that our two-stage framework 
is a generalization of 1) randomly choosing a statistic and 2) 
evenly allocating the total budget among all available statistics, 
and our adaptive algorithm achieves higher efficiency than these 
benchmark strategies in theory and experiment. 

I. Introduction 

With the ever increasing popularity of online social net¬ 
works (OSNs) in recent years, many studies have focused on 
the analysis of OSNs, such as estimating various properties of 
the users and their relationships. OSNs are usually measured 
via graph sampling techniques, because they are typically too 
large to be completely visited and OSN service providers 
rarely make their complete network dataset publicly visible. To 
guarantee the estimation accuracy, many unbiased graph sam¬ 
pling methods have been designed, such as the simple random 
walk with re-weighting (RWRW) fflE, the frontier sampling 
(FS) El and the random walk with uniform restarts (RWuR) 
0. However, OSNs often consist of multiple social graphs 
which can be sampled by different unbiased graph sampling 
methods. For example, in the YouTube social network, users 
are allowed to declare friendship with each other and create 
interest groups for others to join in. This creates two graphs 
whose edge sets correspond to 1) the mutual friendship and 2) 
the sharing of membership of some interest group among the 
users, respectively. For a given measurement target, sampling 
via different graphs usually have different efficiencies, which 
also vary as the measurement target changes. Furthermore, 
various graph sampling methods can be applied to the same 
social graph, e.g., the FS and the RWuR are both realiz¬ 


able in the friendship graph of LiveJournal. However, they 
might induce different sampling efficiencies, which are often 
unknown a priori. Although one can use multiple unbiased 
statistics, generated by different methods on different graphs, 
to form a heterogeneous statistic, it is unclear how one could 
1) optimally allocate the sampling budgets among different 
statistics and 2) optimally combine them. 

As we focus on unbiased estimators, we use the asymptotic 
variance a to measure the efficiency of a statistic (or its 
estimator). We formulate a mixture sampling problem that 
tries to minimize the asymptotic variance of a linearly mixed 
estimator, constrained by sampling budgets. Given allocated 
budgets for different statistics, we prove that the optimal 
weights of individual estimators are inversely proportional to 
their asymptotic variances; under a fixed total budget, we 
rank the allocation decisions and find that a greedy allocation 
is optimal, i.e., allocating more budgets to the statistic with 
smaller asymptotic variance is always better. 

However, the asymptotic variances of the statistics are 
usually unknown before sampling. To address this challenge, 
we design a two-stage framework with a pilot and a regular 
sampling stage. In the pilot sampling stage, we allocate part 
of the sampling budget to all the statistics and infer the most 
efficient statistic by estimating the asymptotic variance of 
each statistic. In the regular sampling stage, we allocate the 
remaining budget to the inferred most efficient statistic. Our 
framework is a generalization of two benchmark strategies; 
1) spending all budget on a randomly chosen statistic and 2) 
allocating the budget among all available statistics evenly. We 
show that our two-stage strategies achieve higher sampling 
efficiency than the two benchmark strategies. Furthermore, to 
allocate an optimal sub-budget for the pilot sampling stage, 
we design an online algorithm to dynamically estimate an 
upper-bound of the optimal fraction during the pilot sampling. 
Because the inference of the most efficient statistic is made by 
estimating the asymptotic variances in the pilot sampling stage, 
it makes our framework adaptive for different measurement 
targets. Our framework does not restrict how the estimators 
of asymptotic variances should be constructed, as long as 
they are asymptotically unbiased. To illustrate, we provide a 
detailed implementation and evaluate the performance of our 
framework in the Douban social network. The experimental 
results show that our technique uses only 18% — 57% of 
the sampling budget needed by the benchmark strategies 


for achieving the same estimation accuracy for a range of 
measurement targets. Our main contributions are as follows. 

• We formulate and solve a mixture sampling problem 
which constructs an optimal estimator of a heterogeneous 
statistic to improve sampling efficiency. In particular, we 
derive the optimal weights of the individual estimators 
in the mixture estimator (Theorem and the optimal 
allocation decisions among the statistics (Theorem |^. 

• We design a two-stage framework and an adaptive algo¬ 
rithm (Algorithm 1) for the pilot sampling, a practical 
solution for the mixture sampling problem when the 
efficiencies of the statistics are unknown before sampling. 

• We show that the two-stage strategies are asymptotically 
optimal (Theorem and achieve higher efficiency than 
two benchmark strategies (Corollary [^l. 

• As a case study, we provide a detailed implementation 
of our framework and evaluate its performance in the 
Douban social network. 


The remaining of this paper is organized as follows. Section 
[n] introduces the concepts and characteristics of unbiased 
graph sampling methods. Section III defines the mixture sam¬ 
pling problem and presents its optimal solution. With unknown 
efficiencies of the statistics before sampling, we design the 
two-stage framework and its adaptive algorithm in Section 
IV Section |V] implements the framework and evaluates its 
performance in the Douban social network. Section VI reviews 
related work and Section IVTIl concludes. 


II. Unbiased Graph Sampling 

We denote an undirected graph in an online social network 
as G=(V,£) with a set of nodes V = - ,U}to represent 

users and a set of edges £ to represent the relationships among 
the users. We denote / as a property and /„ as its value of 
user V. Our measurement target is to estimate the mean value 
of property / over all users in V, i.e., / = /«) 

We consider a graph sampling method that traverses the 
nodes of the graph via a random walk, which generates a 
discrete-time stochastic process with the state space 

of V, i.e., Xt € V for all t S N. We define the random variable 
/(m) as an estimator on the sample path {Xt: t = l,- ■ •, m} 
of m samples. An estimator /(•) is unbiased if £^[/(m)] = / 
for all m G N and is asymptotically unbiased if 
/(m) / as TO —)■ oo, 

where denotes convergence almost surely. If the process 
{XtjigN is ergodic, by the central limit theorem (CUT), 

Vm[f{m)- as to oo, (1) 

where denotes convergence in distribution and W(0, cr^(/)) 
denotes a normal distribution with mean 0 and variance cr'^{f ), 
which is defined by 

cr^(/) = lim mVar{f{m)). (2) 

m—¥oo 

By 0 , we can infer that f{m) to—>- 00 , i.e., f{m) is 

an asymptotically unbiased estimator of /. It also shows that 
the distribution of is asymptotically normal with 

variance cr'^if), which approximately determines how many 


samples are required to achieve a certain level of accuracy 
for the estimator /(to). Thus, we use the asymptotic variance 
(T^(/) to measure the efficiency of an asymptotically unbiased 
graph sampling method (or its estimator) in this paper. 

In the next two sections, we formulate and solve a mixture 
sampling problem, based on which we design a two-stage 
framework to sample via multiple statistics. The estimators of 
these statistics can be based on very different asymptotically 
unbiased sampling methods on different graphs. 

III. Mixture Sampling Problem 

We consider an objective of measuring the mean value of 
property / over the users, i.e., / defined earlier. We refer to 
an asymptotically unbiased sampling method on a social graph 
as a statistic, and assume there are K types of statistics that 
can be applied in the OSN. For any statistic k, we denote the 
random variable fkirrik) as the value of its estimator given rrik 
samples and cr\{f) as its asymptotic variance. We simplify the 
notation crf{f) as tr^ when we focus on a single property /. 
Because each estimator /fc(TOfc) is asymptotically unbiased, 
we use the asymptotic variance cr^ as a metric for comparing 
the efficiencies of these statistics. If the asymptotic variance 
af is smaller than , we say statistic i is more efficient than 
statistic j for estimating /. Furthermore, we denote k* as the 
most efficient statistic, i.e., cr^. = minja^ : fc = 1, • • • ,K}. 

A. Mixture Sampling Problem 

Suppose we have a total sampling budgej^ of M samples 
and K types of candidate statistics, we consider the mixture 
sampling problem of how to allocate the sampling budget 
among different statistics and how to construct an unbiased 
estimator / for / so as to minimize its asymptotic variance. 

We denote a — (oi,--- jCk) as a budget allocation 
decision, where each > 0 defines the fraction of the total 
budget allocated to statistic k. We define ICa = {k : Ok > 0} 
to be the set of active statistics. Thus, each active statistic k 
has a budget rrik = QkM and an estimator /^(TOfe). Because 
the sum of budget allocated to each statistic cannot exceed 
the total budget, we define the constraint set of the allocation 
decisions as ,4 = {a| '^k=i Ofe < 1; Ufe > 0 Vfc = 1, • • • , K}. 
Given a vector / = (/i, • • • , /ir) of estimators, we consider a 
mixed estimator f{w) which linearly combines the individual 
estimators by a weight vector w = {wi, • • • , wk), defined as 

K 

f{w) = '^Wkfk- ( 3 ) 

k=l 

Each weight Wk is used to determine the relative importance 
of the individual estimator fk- Under a total budget M and 
an allocation decision a, we define the mixture estimator with 
weights w as 

f{a,M,w) ^ ^ Wk- fkirrik) = ^ Wk ■ fk{akM). ( 4 ) 

fcGKa /cG/C„ 

We define the asymptotic variance of the above estimator as 
‘;{a,w) = lim M ■ Var{f(a, M,w)). (5) 

M—>-oo 

*We assume that one unit of the budget is the cost of visiting a node. 





If each fk is asymptotically unbiased, we hope that the 
constructed mixture estimator f{a,M,w) would still be 
asymptotically unbiased. We denote the set Wa to be the 
domain of weights under the budget allocation a such that 
for every w G Wa, is asymptotically unbiased. 

Our design goal is to construct the optimal unbiased esti¬ 
mator f{a,M,w) whose asymptotic variance <;(a,w) could 
be minimized. We formulate two related mixture sampling 
problems as follows. In the first problem, we consider a given 
allocation decision a and we denote <;a{w) = <;{a,w). The 
objective is to find the optimal weights w* that solve; 

Minimize <ia{w) subject to w G Wa- (6) 
In the second problem, the objective is to find the optimal 
allocation decision a* and the corresponding optimal weights 
w*{a*) that solve; 

Minimize <;(a,w) 

(7) 

subject to a G A and w G Wa- 

The first problem can be regarded as a sub-problem of the 
second one, where the allocated decision is predetermined. 


B. Optimal Weights and Allocation Decisions 

In this subsection, we solve the optimal weights to construct 
an estimator and the optimal budget allocation for maximizing 
the efficiency of an estimator. Under a fixed budget allocation 
decision a, intuitively, a larger weight Wk should be given to 
an estimator fk if statistic k is more efficient, i.e., its asymp¬ 
totic variance is smaller. The following result provides an 
affirmative answer to the intuition. 


Theorem 1. Assume all the pure estimators fk are inde¬ 
pendent of each other. The mixture estimator f{a,M,w) is 
asymptotically unbiased for f if and only if the domain of 
weights under an allocation decision a satisfies 


Wa = 


E 

k^K-a. 


Wk = l 


( 8 ) 


Its asymptotic variance can be characterized by a function of 
the allocation a and the weight vector w, defined as 




(9) 


The optimal solution w* of the optimization problem in 
Equation (0 satisfies 

= ^/ E ^kGlCa, ( 10 ) 

i&Ka 

and the corresponding minimum asymptotic variance is 

1 

a-k 




E U/c 


.keKa 


Theorem [T] shows that to guarantee the mixture estimator to 
be asymptotically unbiased, the sum of weights of the active 
statistics must be one. It also tells that when the allocation 
decision a is fixed, the optimal weight w^. of each estimator 
fkiiTik) is proportional to Ofc and inversely proportional to 
its asymptotic variance cr^. Based on Theorem [T we denote 
w*{a) to be the optimal solution of (j^ defined in (lOi and 


the second optimization problem 0 could be stated as finding 
the optimal allocation a* that solves; 

Minimize iz{a,w*{a)) subject to a G A. (11) 
Intuitively, an optimal solution should allocate more budgets 
to the more efficient statistic. The next result shows that a 
greedy strategy that allocates all budgets to the statistic with 
the smallest asymptotic variance is actually optimal. 

Theorem 2. Assume that the conditions of Theorem hold. 
Denote relabeled set of asymptotic variance 

of {ct\}k=i '•vith an ascending order. For any allocation 
decisions a and d satisfying J2k=i ®(fc) ^ 12k=i ®(fe) 
i = 1,2, - ■ ■ , K, we have 

c^{a,w*{a)) < c^{d,w*{d)). 

In particular, the optimal allocation a*, which solves the 
optimization problem in Equation Q, satisfies a*^ = l{fc=fe*} 
with the minimum asymptotic variance 

<,{a*,w*{a*)) = al,. 

Theorem |2] states that an allocation decision a is more 
efficient, i.e., it induces a smaller <;{a,w*{a)), if it allocates 
more budgets to more efficient statistics. In particular, if we 
greedily allocate all budgets to the most efficient statistic k*, 
the asymptotic variance <;(a,m*(a)) will be minimized. 

Theorem [T] and show that the optimal solutions are 
closely related to the asymptotic variances of the indi¬ 
vidual statistics, and the directions for decreasing <z{a, w) 
are allocating as much budget to statistic k* as possible and 
weighting the individual estimators inversely proportional to 
their asymptotic variances. However, the asymptotic variances 
al are usually unknown before sampling. To address this 
challenge, we propose a two-stage framework, where we infer 
the best statistic k* in the first stage before allocating all the 
remaining budget greedily in the second stage. 

IV. Adaptive Two-Stage Framework 

In this section, we first explain the basic concepts of a two- 
stage framework and then show the framework achieves higher 
sampling efficiency than two benchmark strategies, finally we 
propose an adaptive algorithm to determine an upper-bound 
of the optimal budget fraction which is allocated to the first 
stage. 

A. Two Benchmark Strategies and A Two-Stage Generalization 

Without knowing the asymptotic variances a^ of the individ¬ 
ual statistics, we start with two naive strategies as benchmarks. 
The first strategy spends all budget M on a randomly chosen 
statistic fc; the second strategy evenly divides the budget M 
among K statistics to construct the mixture estimator. We call 
these two benchmark strategies as the Random Statistics (or 
RND) and Average Statistics (or AVG), respectively. 

Based on the two benchmark strategies, we consider a two- 
stage generalization, which spends a partial budget to estimate 
the best statistic k* in a pilot sampling stage and allocates the 
remaining budget to an estimated best statistic fc* in a regular 
sampling stage. We assume that a fraction c G [0,1] of the 






total budget M is allocated for pilot sampling and name the 
cM samples as the pilot budget. We evenly allocate the pilot 
budget among all K statistics, and therefore, each statistic 
k is allocated a budget of nik = cM/K samples in this 
stage. We use these pilot samples to make an asymptotically 
unbiased estimate of each asymptotic variance a^, and define 
the estimated value by a].{mk)- Most likely, the statistic with 
the smallest estimated asymptotic variance tends to be the most 
efficient statistic k* for estimating /. We call this statistic 
the inferred most efficient statistic and denote it as k*{cM), 
parameterized by the pilot sampling budget cM. In the regular 
sampling stage, we allocate all the remaining sampling budget 
(1 — c)M to the inferred most efficient statistic k*, and fully 
use the total budget M to construct a mixture estimator. 

Under the above two-stage framework, we denote a{c) as 
the effective allocation decision, defined by 

+ ( 12 ) 
through which we can define the effective budget for each 
statistic k as mk{c) = ak{c)M naturally. After both sampling 
stages, we construct a mixture estimator by using an estimated 
optimal weight vector w*{c). We use the estimated value 
d\{rnk) to approximate a\, and define w*{c) by substituting 
cr^ with a\{mk) in the optimal weight of Equation (lOi as 


w*k{c) = 


Okie) 


a-l {mk{c)) 


/E 

ieKa. 


ai{c) 




Vfc G K.a. (13) 


Consequently, the corresponding mixture estimator and its 
asymptotic variance can be written as f {a{c), M, w* (c)) and 
(;(a(c),w*(c)), respectively. 

The two-stage framework actually uses the AVG and RND 
strategies in its pilot and regular sampling stages, respectively. 
In particular, the estimated statistic k* plays the role of a 
random statistic in the RND strategy. Also, the framework can 
be seen as a generalization of the two benchmark strategies, 
because the Average and Random Statistics are equivalent to 
a two-stage strategy of c = 1 and c = 0, respectively. 


Theorem 3. The asymptotic variances of the Random 
Statistics and Average Statistics are <t(a(0), t(i*(0)) and 
^(a(l), a(l)), respectively. They satisfy 

]E[<t(a(0),m*(0))] =<j(a(l), a(l)) = ^ ^ cr^. 

fc=i 

Theorem states that the expected asymptotic variance of 
the Random Statistics and the asymptotic variance of Average 
Statistics both equal the average of the asymptotic variances 
of all individual statistics. 


B. Asymptotic Performance of Two-Stage Strategies 

Our two-stage framework does not restrict how the asymp¬ 
totic variances are estimated in the pilot sampling stage. We 
will show that as long as d\{-) is an asymptotically unbiased 
estimator for (t|, the two-stage strategies will outperform the 
two benchmark strategies. The detailed design of the estimator 
may depend on the sampling method of statistic k, and 
we will give an example of implementation in a later section. 


Given any strategy c G [0,1], we can define the (unknown) 
optimal allocation decision as a*{c) = (ai(c),..., a*j^{c)) as 
4(c)^c/iT+(l-c) yk = l,...,K. 

Under this optimal allocation a*(c}, by Theorem the 
corresponding optimal weight vector becomes w*(a*(c)). 
Intuitively, when a budget cM is used to estimate each 
in the pilot sampling stage, rrik = cM/K for any statistic k 
and the best statistic k* is more likely to induce a smaller esti¬ 
mated asymptotic variance al^rrik) than other statistics. Con¬ 
sequently, the resulting allocation a(c) and weights w*{c) are 
more likely to be equal to the optimal a*{c) and w*{a*{c)), 
respectively. We consider the two-stage strategy c as a function 
of the total budget M, denoted as c(M), and simplify the 
notation a*{c{M)) as a*{M). The next theorem shows that 
when the pilot budget fraction c is higher than the order of 
M~^, the two-stage strategy c{M) is asymptotically optimal. 


Theorem 4. Assume each estimated asymptotic variance 
is asymptotically unbiased for cr^ (fc = 1, • • • , K), i.e., 
^ki^k) CTfe as ruk —>■ -|-c». 

If c{M) G w i.e., for all <5 > 0, there exists a positive 

number M' such that c{M) > SM~^ for all M > M', 
k* k* , a(c{M)) ^ a*{M) and 
w* (^c{M)) w* (^a* (M)) as M —>■-|-oo. 

Theorem]^ shows that as the total budget M grows, to guar¬ 
antee an (asymptotic) optimal two-stage strategy, the fraction 
c for the pilot budget does not need to be large. The condition 
c{M) G ensures that the pilot budget c{M)M grows 

with M unboundedly as M goes to infinity, although c itself 
could approaches zero, such that the estimated asymptotic 
variance af.ijnk) will converge to af.. Consequently, the two- 
stage strategy c(M) will identify the most efficient statistic k* 
via the pilot sampling and set the optimal allocation a* (^c{M)^ 
and optimal weight (a*(c(M))) for the mixture estimator. 

When we simply give the same weight for each sample 
point, for any allocation a, the corresponding weight vector 
becomes w = a, which are proportional to their sample sizes. 
To distinguish the benefit of choosing an optimal allocation 
a* and an optimal weight w*, we consider an intermediate 
mixture estimator /(a, M, a), which gets affected only by the 
allocation decision a and has an asymptotic variance c(a, a). 


Corollary 1. Under the conditions of Theorem^ for any pilot 
fraction c{M) G as M ^ -l-oo, we have 

^(a(c(M)),m*(c(M))) ^ ^(a*(M), (a*(M))) , 

and the asymptotic limit of satisfies 

1 ^ 

{a*{M))) < , (a*(M), a*(M)) < 

k=l 


and q{a*{M),w*{a*{M))^ < 


Kal 


K+{l-K)c{My 


As a consequence of Theorem Corollary shows that as 
M grows, the asymptotic variance <, induced by the strategy 
c{M) converges to an optimal value {a*{M),w* (a*(M))). 







The first inequality implies that 1) using the estimated optimal 
weight w*{c{M)) is more efficient than the equal weight 
w = a, and 2) using w = a is again more efficient than 
the two benchmark strategies, whose (expected) asypiptotic 
variances equal ^ shown in Theorem pf The 

second inequality provides an upper-bound for the optimal 
which can be derived from an estimator (c(M))M) 

which only uses the samples of the inferred best statistic k* 
and throws out the samples of other statistics collected in the 
pilot sampling stage. 

C. Optimal Fraction for Pilot Budget 

Our design of any two-stage strategy c(M) G is 

asymptotically optimal. However, a more practical problem is 
that, given a finite budget M, how to choose an optimal frac¬ 
tion c*(M) for the pilot budget that maximizes the efficiency 
for the mixture estimator /, i.e., c*{M) solves: 

Minimize Var (^f {a{c), M, w*{c))), 

subject to c G [0,1]. 

On the one hand, when allocating more budget for the pilot 
sampling, each could provide a more accurate estimation 
for the asymptotic variance tr^ and the best statistic k* would 
have a higher chance to be picked out in the regular sampling 
stage. On the other hand, increasing the pilot budget means 
that more budget will be allocated to some inefficient statistics 
at the pilot sampling stage. One needs to balance the above 
contradictory conditions so as to obtain an optimal fraction 
c*{M). In practice, it is hard to obtain the exact value of 
the optimal fraction for the pilot budget c*(M), because it 
depends on the unknown values of asymptotic variances a^. 
However, we will provide a heuristic algorithm to estimate 
c*{M) effectively, which is based on the following theoretical 
result on the monotonicity of c*{M). 

Theorem 5. Assume the rate of convergence of the esti¬ 
mated asymptotic variance a^{m) for each is 
i.e., sup^gR+|G^ 2 (^)(x) - G^ 2 {x)\ = where 

= p (CTfc(m) < x) and G^ 2 (a:) = '^{x>al} are the 
cumulative distribution function ofa^{m) and af., respectively, 
and the order Tjk > 0. Let r] = imn{r]k : k = 1,-■ -jK}. The 
optimal fraction satisfies lim c*{M) = 0 with the rate of 

M—¥-\-00 

convergence ). 

Theorem]^ shows that the optimal fraction c*{M) decreases 
to zero asymptotically with the rate 0(M“^~'’^) when M 
grows. Intuitively, as the total budget M increases, to guaran¬ 
tee the same accuracy for estimating k*, we only need to keep 
the pilot budget cM constant and thus the fraction c becomes 
smaller. Both Theorem and imply that when M becomes 
larger, the optimal fraction c*{M) should decrease. Therefore, 
we assume that c*{M) follows a decreasing trend as M 
increases (In Section [V] our evaluations in the Douban social 
network also support this conjecture well), based on which we 
propose an adaptive algorithm to dynamically determine the 
optimal fraction c*{M) for the pilot sampling. 


Algorithm 1 Adaptive Two-Stage Sampling (M, Am) 

1: c i — Am/AL', 

2: spend Am budget for pilot sampling; 

3: while c < c*{cM) do 
c i — c -t- Am IAT\ 

spend Am more budget for pilot sampling; 

end while 

4: choose the estimated best statistic fc*; 

5: spend the remaining budget (1—c)M for regular sampling; 


Algorithm performs the pilot sampling in an adaptive 
manner. It takes two input parameters: the total budget M and 
a budget spending stepsize Am G (0, M). We denote c* (•) as a 
function where each c* (m) provides an estimated upper-bound 
of the optimal fraction c*{M), when m number of samples are 
used. In step 3, we increase the pilot budget by Am if the spent 
fraction c is smaller than the derived upper-bound c*{cM) for 
c*{M), until c exceeds the upper-bound c*{cM). Based on the 
cM samples generated in the pilot sampling stage, we choose 
the estimated best statistic k* and spend the remaining budget 
(1 — c)M for regular sampling as usual. In general, given any 
m pilot samples, the function c*(-) uses them to estimate an 
optimal fraction c*{m') for some m' < m. Because c*(-) has 
a decreasing trend in general, we could use this estimation 
of c*{m') as an upper-bound for c*{M) so as to determine 
whether the pilot sampling stage should end. As the sampling 
budget cM increases, the estimation c*{cM) should decrease 
and approach c*{M), because it estimates some c*{m') and 
m' increases. Notice that our algorithm does not restrict how 
the upper-bound estimation c*(-) should be implemented, and 
we will provide an example of implementation which we use 
in our evaluation in a later section. Finally, although a large 
stepsize approaches c*{M) faster, to avoid overestimating the 
pilot budget, a small value of Am should be used in practice. 

V. Evaluation in Douban Social Network 

In this section, we apply the adaptive two-stage framework 
to the Douban social network, a popular Chinese web site 
providing user comment and recommendation services for 
books, music and movies. We first introduce multiple statistics 
which can be realized in Douban and then provide a detailed 
implementation of our framework to measure the statistics, 
finally we evaluate the performance of the framework. 

A. Multiple Statistics 

Similar to Twitter and Sina microblog, users in Douban can 
follow each other, and therefore, Douban can be seen as a 
followship grapfj^ in which the edges capture the following 
relationship. Douban also allows users create interest groups 
for others to join in. We consider two users who have a 
common group share a membership and Douban can also be 
seen as a membership graph. These two different social graphs, 

^Here, we serve the followship graph as an undirected graph and one 
following relationship coiresponds to an undirected edge. 






together with two random walk based sampling methods, the 
RWuR and FS introduced next, provide four different available 
statistics. 


1) The random walk with uniform restarts (RWuR): The 
RwuR a is a hybrid sampling method that mixes random 
walk crawling and uniform node sampling. It generates a 
sample set {XtjtgN as follows. At each step t, assume the 
current node is Xt=i. With probability a/{di + a), it jumps 
to an arbitrary node j of the graph chosen uniformly and 
make the transition Xt+i =j. With probability di/{di + a), it 
uniformly chooses an z’s neighboring node k, i.e., Xt^i =k. 
The parameter a (> 0) controls the probabilities of random 
walk and jump. Specially, when a = 0, the RWuR is 
the simple random walk, and when a = +oo, the RWuR 
becomes the uniform node sampling. Obviously, the sample 
set {AiijtgN is biased towards the high-degree nodes. To 
correct the bias, it uses the Hansen-Hurwitz estimator EH 
to re-weight the samples, i.e., the weight of the sample Xt is 
inversely proportional to dxt + the unbiased estimator 

for / is 


/(w) = 


fxt 

dxt + a 


m 


/E 


1 

dxt+a' 


(15) 


2) The frontier sampling (FS): The FS E is a distributed 
sampling method that performs s (G N) random walkers on a 
graph. Initially, it uniformly obtains s nodes as the start nodes 
of the s random walkers. At each step, it first randomly selects 
the r-th walker with probability dy^/ J2i=i where Vi is the 
current node of the Ath walker. Then the r-th walker uniformly 
chooses a vfs neighboring node as the next sample and moves 
to it. Similar to the RWuR, the bias towards high degree nodes 
of the FS can be corrected by the Flansen-Hurwitz estimator, 
and the unbiased estimator for / is 


m p m 


1 

dxt 


(16) 


The RWuR and FS samplers are less likely to get trapped 
in loosely connected components of a graph via jumping 
randomly and running multiple walkers, respectively. Thus, 
both of them usually perform better than the simple random 
walk with re-weighting Gl El, but we do not know which 
one achieves higher sampling efficiency in an unknown graph. 
Besides, it is unclear how the efficiencies of the two methods 
vary on the fellowship and membership graphs. Therefore, 
we choose the four statistics, the RWuR and FS on the 
fellowship and membership graphs, to demonstrate our two- 
stage framework. 


B. Implementation of Two-Stage Framework 

Our adaptive two-stage strategy does not restrict how the 
estimators of the asymptotic variances d'fc(-) and the upper- 
bound estimation of the optimal fraction c*(-) are constructed, 
as long as they are asymptotically unbiased. Next, we provide 
an example of detailed implementations of and c*(-) for 
measuring in Douban. 


1) Estimating the asymptotic variances: Both the pilot 
sampling and the adaptive Algorithm [T] need to estimate the 
unknown asymptotic variances cr^. Assume the sample set 
used to estimate cr^ is collected by q (> 2) samplers whose 
budgets are all 1. We denote the estimated value for / based 
on the j-th sampler as f\^\l) for j = 1, • • • , g, which serves 
as a sample of the estimator fk{l)- Then the sample variance 
of (0 • J = Ij ■ ■ ■ j 9 } is defined by 

2 

I ■■■ 


Si{q,l) = 


g 


q-l^ 

i=i 


^ i=l 


(17) 


It describes how far Vlfk \l) 0 = 1,2, •• • , g) are spread 
out. From the definition of asymptotic variance in Equation 
(|^, Sl{q, 1) is an asymptotically unbiased estimate of cr^, i.e., 
liTci lVar{fk(l)) = al as qj^oo. (18) 

l—fOO 


Thus, we can use Sl(q, 1) to estimate the asymptotic variance 
i.e., a:\(ql) = S^(q,l). Also, because all unbiased graph 
sampling methods have the same definition of the asymptotic 
variance from Equation Q, this implementation is applicable 
to any one of them. 

2) Estimating upper-bound of optimal fraction c*(M): 
Given any sub-budget m < M, we provide an implementation 
of the upper-bound estimation function c*{m) as follows. We 
use the budget m to collect B sample sets whose sizes are all 
m' = for each statistic, and by using these m samples, 
we could estimate the optimal fraction c*{m') when the total 
given budget is m'. We denote the 6-th sample set of the fc-th 
statistic as S^\ We could try different two-stage strategies 
with fraction c G (0,1) on the 6-th group of sample sets 
: k = I,-- - ,K}. Specifically, like a normal two-stage 
strategy of fixed c, we obtain samples from each set 
as the pilot sampling, and use them to estimate the best statistic 
k*, and then use the remaining (1 — c)m' samples of the 
inferred best statistic to generate a realization of the estimator 
f{a{c),m',w*{c)). Einally, we calculate the sample variance 
of the B realizations obtained from the B groups of sample 
sets. Based on ( [T4| ), we choose the fraction c that minimizes 
the sample variance as an estimation for c*(m'), which serves 
as an upper-bound for the optimal fraction c*{M). 

When increasing the number of realizations B, the esti¬ 
mation c*(m') for c*(m') becomes more accurate. Flowever, 
the budget m! = decreases under a fixed budget m, and 
therefore, using c*{m') as an upper-bound for the optimal 
fraction c*{M) could be loose. As a result, we recommend 
to set the parameter B moderately. 


C. Measurement Setup 

The publicly available information for every Douban user 
includes user-id, location, lists of followers, users he/she 
follows and the interest groups he/she joins in. We consider 
two measurement targets, i.e., the average number of followers 
of users and the average number of interest groups of users. To 
measure these targets, we develop crawlers to sample via the 
four statistics (K = 4), i.e., the RWuR and ES methods on the 
fellowship and membership graphs. We ignore the users who 









Fig. 1. NRMSE of the adaptive two-stage strategy (ATS), Average Statistics (AVG), Random Statistics (RND), and individual statistics including the RWuR 
and FS methods on the followship graph (RW-f and FS-f) and on the membership graph (RW-m and FS-m), when we vaiy the total sampling budget M. (a) 
measure the average number of followers of users, and (b) measure the average number of interest groups of users. 


do not have any followship or membership, as these isolated 
users cannot be visited via crawling. 

We set the total sampling budget to be M = 4 • 10"*, which 
represents about 0.05% of the total number of Douban user^ 
For the statistics based on the FS method, we set the number 
of random walkers s = 50 in a FS sampler. For the statistics 
based on the RWuR method, we moderately set the parameter 
which controls the probabilities of random walk and jump a = 
0.1. Besides, we also consider the cost of uniformly choosing a 
start node and jumping to an arbitrary node in the FS or RWuR 
method. This cost is about 14 units of budget in the Douban 
network, i.e., it needs to query an average of 14 randomly 
generated user-ids to obtain a valid one in the user-id space. 
In the two-stage framework, we set the number of realizations 
B = 10 and the budget spending stepsize Am = 2%M = 800 
for Algorithmic To estimate the asymptotic variances, we use 
<7 = 5 samplers. 

We also implement the benchmark strategies, i.e., the Ran¬ 
dom Statistics and Average Statistics, and the four single¬ 
statistic strategies for comparison. To measure the esti¬ 
mation accuracy of the different sampling strategies, we 
use Normalized Root Mean Square Error (NRMSE) |l2^@l, 

\J^{f — f)^/f where / is the true value of the measurement 
target and / is the estimated one. Because the “ground truth” 
/ is not published by Douban, we calculate the NRMSE by 
taking as / the grand average of / values over all samples 
collected via all full-length crawlers and statistics. All exper¬ 
iment results presented in the following are the average of 25 
independent simulations and our crawls were performed from 
Nov. 5th to 11th of 2013. 

D. Evaluation Results 

1) Performance of the Adaptive Two-Stage Strategies (ATS): 
Eigure [C shows that the efficiencies of different statistics may 
vary for different measurement targets. Eor example, we ob¬ 
serve that when we measure the average number of followers 

^By Nov. 15 2013, Douban service provider declare there are about 79.2 
million users. 



(a) Number of followers (b) Number of groups 


Fig. 2. The ratio of needed budget between the two-stage strategy and 
others including the Average Statistics (AVG), Random Statistics (RND), and 
individual statistics including the RWuR and FS methods on the followship 
graph (RW-f and FS-f) and on the membership graph (RW-m and FS-m), so 
as to attain the same NRMSE. (a) measure the average number of followers 
of users, and (b) measure the average number of interest groups of users. 


of users, the RWuR method on the membership graph (RW-m) 
leads to higher estimation accuracy than the ES method on the 
followship graph (ES-f) as shown in subfigure |l(a)[ however, 
when the target is the average number of interest groups of 


users, the conclusion is reversed as shown in sub figure 1(b) 


Thus, the efficiencies of the statistics vary as the measurement 
target changes and choosing a bad statistic, e.g., the RW- 
f strategy for estimating the average number of followers, 
may lead to an inaccurate estimation. Without knowing the 
efficiencies of the individual statistics, Eigure shows that 
our adaptive two-stage strategy (ATS) always outperforms 
both benchmark strategies (AVG and RND) regardless of the 
measurement target. Eurthermore, our strategy (ATS) is only a 
bit inferior to the true best statistic (RW-m for estimating the 
average number of user’s followers or ES-f for estimating the 
average number of user’s groups), which could be used when 
the asymptotic variances of all statistics are known. Eigure [T] 
also demonstrates that our framework has good adaptivity for 
different measurement targets in the two subfigures. 

Eigure shows the budget saving of our two-stage strategy 
(ATS) compared with the benchmarks (AVG and RND) and 
other single-statistic strategies if they can fulfill the given 
NRMSE target. Eor example, when measuring the average 









































(a) Number of followers (b) Number of groups 

Fig. 3. With total sampling budget M = A - 10^, NRMSE of the two-stage 
framework with the estimated optimal weights, i.e., (TS-aw), 

the two-stage framework with the same weight for each sample point, i.e., 
/(a, M, a) (TS-aa) and Average Statistics (AVG), when we vary the fraction 
of pilot budget c. (a) measure the average number of followers of users, and 
(b) measure the average number of interest groups of users. 


number of followers of users in subfigure 2(a) ATS saves 
about 49% budget compared with the AVG strategy to obtain 
the NRMSE = 0.015. From subfigures 2(b) as the target 
is the average number of groups of users, ATS saves about 
75% budget compared to the RND strategy for obtaining 
the NRMSE = 0.025. In general, we observe that our ATS 
strategy requires only 18% to 57% of the budget needed for 
the benchmark strategies to achieve the same NRMSE. We 
also observe when measuring the average number of followers 
(resp. groups) of users, the best statistic RW-m (resp. FS-f) 
uses the smallest amount of budget, which is consistent with 
the observations from Figure and the result of Theorem 


2) Benefit of optimal allocation decision and weights : The 
two-stage framework tries to improve estimation efficiency 
by choosing budget allocation decision and setting estimated 
optimal weights for the mixture estimator. Figure compares 
the NRMSE of our two-stage strategy when the weights are set 
to be equal (TS-aa) or optimally adjusted (TS-aw) and that of 
the AVG benchmark strategy, when the fraction c of the pilot 
budget varies along the x-axis. We observe that the two-stage 
strategy with optimal weights always outperforms that with 
equal weights, which again outperforms the AVG benchmark 
strategy. Notice that under the equal weights, c = 0 and c = 1 
corresponds to the RND and AVG strategies, respectively, 
which have the same performance as shown in Theorem]^ In 
general, when c increases from 0 to 1, the benefit of two-stage 
strategy first increases and then decreases. This is an integrated 
result of two competing factors: 1) increasing the pilot budget 
help select the more efficient statistic at the regular sampling 
stage, and 2) at the same time more budgets are allocated to 
the inefficient statistics at the pilot sampling stage. We also 
observe that the benefit of using optimal weights is larger 
when the pilot fraction is larger. The reason is that with the 
larger pilot budget, more samples are used on the inefficient 
statistics and therefore, optimal weights are more needed to 
discount those statistics. 


3) Effectiveness of the adaptive Algorithm^ We imple¬ 
mented Algorithm [T] for estimating the optimal pilot fraction. 
Figure shows that the estimated optimal pilot fraction 




(a) Number of followers (b) Number of groups 

Fig. 4. With total sampling budget M = 4 * 10“^, the estimated upper 
bound of the optimal fraction c*{cM) and the spent fraction of budget c 
when the iteration increases in Algorithm (a) measure the average number 
of followers of users, and (b) measure the average number of interest groups 
of users. 


c*{tAM) for c*{M) (solid line) has a decreasing trend as the 
number of iterations t increases. It is consistent with our result 
that the optimal pilot fraction c*{M) decreases as the budget 
M grows. The consumed fraction of the pilot budget c (dash 
line) increases linearly (at a rate of Am) with the number of 
iterations. When the consumed pilot fraction c is larger than 
the estimated upper-bound of optimal fraction, the iteration 
stops in Algorithm [T] Subfigures 4(a) (resp. 4(b) i show that 
when measuring the average number of users’ followers (resp. 
groups), the estimated optimal pilot fraction 40% (resp. 22%) 
approximates efficiently the real value 32% (resp. 16%). These 
results show that Algorithm [T] is effective for setting a near- 
optimal pilot fraction in the practical two-stage sampling. 

4) Observations of different statistics: At last, we provide 
some insights into the different statistics. Subfigure 1 1(a) indi¬ 
cates that the RWuR and FS methods on the membership graph 
perform better than them on the fellowship graph when we 
measure the average number of followers of users. However, 
when the target is the average number of groups of users, 
the conclusion is reversed as shown in subfigure 1(b) The 
reason may be that the fellowship (resp. membership) graph 
has a strong cluster feature IS] that makes the samples highly 
correlated on the number of the users’ followers (resp. groups). 
This strong correlation leads to a poor estimation accuracy. 
We also observe that, for the fellowship graph, the FS method 
achieves higher efficiency than the RWuR; while the RWuR 
has smaller estimation error than the FS for the membership 
graph. Because the RWuR sampler frequently chooses an 
arbitrary node as restart on a less connected graph (e.g., the 
fellowship graph), which costs large budget and decreases the 
estimation accuracy. On the other hand, the RWuR is close 
to a single random walker on a well connected graph (e.g., 
the membership graph). Compared with the FS with multiple 
random walkers, it saves the cost of obtaining multiple uniform 
start nodes and converging to the walkers’ steady state. 


VI. Related Work 

Graph Sampling Techniques. As OSN service providers 
rarely make publicly visible to the frame information of entire 
networks, most widely used graph sampling techniques in 
OSNs are crawling methods. Early graph crawling methods 












































are based on Breath-First Search (BFS), Depth-First Search 
(DFS) and Snowball Sampling (SBS) Q. In particular, BFS 
has been frequently used to explore large networks, such as 
Youtube and Facebook ||6l. However, these methods introduce 
a large bias towards high degree nodes and it is difficult to be 
corrected in general graphs 

Recently the most popular graph crawling is random walk- 
based sampling, including simple random walk with re¬ 
weighting (RWRW) HI m and Metropolis-Hastings random 
walk (MHRW) IH. RWRW is considered as a special case of 
Respondent-Driven Sampling (RDS) fD if only one neighbor 
is chosen in each iteration and revisiting nodes is allowed. 
It is also biased to sample high degree nodes, but the bias 
can be corrected by the Hansen-Hurwitz estimator shown in 
Il6]l2l. RWRW was not only used to sample OSNs ITIIEI, but 
also P2P networks and Web ifTSlUhl . MHRW is based on the 
Metropolis-Hastings (MH) algorithm and provides unbiased 
samples directly Ellll. Some studies Gill have shown that 
RWRW estimates are more accurate than MHRW estimates. 

Improvement of sampling efficiency. Researchers have pro¬ 
posed some methods to improve the sampling efficiency 
against random walk-based sampling, including the FS ||3l 
and RWuR Q methods which we apply as showcases in this 
work. Besides, Kurant et al. ifTTll presented a weighted random 
walk method to perform stratified sampling with a priori 
estimate of network information. Lee et al. m proposed a 
non-backtracking random walk which forbids the sampler to 
backtrack to the previously visited node, and they theoretically 
guaranteed the technique achieves higher efficiency than a sim¬ 
ple random walk. Our work concentrates on how to combine 
the existing statistics (sampling methods) efficiently and thus 
is complementary to their approaches. 

It is worth mentioning that, Gjoka et al. m designed a 
multi-graph sampling technique for the social networks which 
have multiple relation graphs. Their technique improves the 
convergence rate of the sampler by walking along a union 
graph of all relations. But it does not distinguish the efficien¬ 
cies of walking on different relation graphs. In this paper, 
we propose the two-stage framework to select an inferred 
most efficient one from multiple graphs to improve sampling 
efficiency further. 

VH. Conclusions 

In this paper, we consider the problem of using multiple 
statistics to efficiently sample online social networks. Given a 
fixed sampling budget, we design budget allocation decisions 
and combine them to construct an optimal estimator. In par¬ 
ticular, we formulate a mixture sampling problem which con¬ 
structs the optimal mixture estimator, and derive the optimal 
weights and a condition of ranking budget allocation decisions 
for the optimal estimator. Because the asymptotic variances of 
the individual statistics are unknown in practice, we propose an 
adaptive two-stage framework, which spends a partial budget 
to test all different statistics in the pilot sampling stage and 
allocates the remaining budget to the inferred best statistic in 
the regular sampling stage. To optimally set the sub-budget 


for the pilot sampling stage, we design an adaptive algorithm 
to dynamically decide an upper-bound of the optimal pilot 
budget and test whether the pilot sampling should end. We 
implement the adaptive two-stage framework and evaluate 
its performance in the Douban network. We demonstrate, in 
theory and experiment, that our two-stage framework achieves 
higher sampling efficiency than two benchmark strategies. 
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Appendix 


Proof of Theorem From Equation Q, we have 

lim f{a,M,w)= lim 'V' Wk ■ fk{akM) 
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implying that the mixture estimator f{a,M,w) is asymptot¬ 
ically unbiased for / if and only if J^keic '^k = i-e-^ 

Equation (j^ concludes. Then from Equation (^, observe that 
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Based on Cauchy-Schwarz inequality, it satisfies 
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Proof of Theorem As each estimated asymptotic variance 
(T^(-) is an asymptotically unbiased for a\ [k = - ■ ,K), 

observe that 

lim P{k* = k*) 

M-foo 

= lim < cr| Vj = 1, • • • , AT) = 1 

holds up if c(M) S Thus, we have, as M —>■ 00 , 

a,(c(M)) ^ ^ + (1 - c(M)) . = aUM) 

for Vfc = 1, • • • , AT. Consequently, it satisfies k* -11^ 
k*,a{c{M)) -11^ a*{M) and w{c{M)) -11^ w*{a*{M)) as 
M —>■ + 00 . 

Proof of Corollary Erom Theorem [T] for any 

c(M) G w(M~^), we have <;{a*{M),w*{a*(M))) < 

<; {a*{M)^a*{M)) as M ^ 00 , where 
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where the equality holds up if and only if Wk = 
Thus given an allocation decision a, for any 
weight vector w G Wa, 
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holds up, i.e., w* solve the optimization problem in Equation 

Q. 
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Proof of Theorem 12 If the allocation decisions a and 

a' satisfies Yhk=i “(fc) > ELi “(fc) (* = lr • '.K), 
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holds up. Based on Theorem [T] we have 
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In particular, for any a, the allocation a* satisfies 

^{a,w*{a)) > <i{a*,w*{a*)) = ah- 


Proof of TheoremJ^ Because E[i^(a(c(M)), i(>*(c(il^)] = 
lim M ■ Var{f{aic),M,w*(c))) from Equation (5i, the 

—^-|-oo 

fraction c*{M) minimizes E[ij(a(c(M)), iu*(c(M)))]. 
When the convergence rate of estimated asymptotic 
variance fffc(m) for cr^ is (k = I,-- - ,K) and 

c{M) G u){M~h, it satisfies E[c:(a(c(M)),tu*(c(M)))] —>■ 
E[<;(a*(M), tu*(a*(M)))] as M —> -foo with the conver¬ 
gence rate 0((c(M)M)“'') from Theorem 0 and Bounded 
Convergence Theorem ll20l . Eurther, as c*(M) minimizes 
E[<r(a(c(M)), i()*(c(M)))], it satisfies the first-order condi¬ 
tion lim tigE[<?(a(c(M)),m*(c(M)))] , ^ 
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therefore " Ic=c-(m)) = ©(-= 
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Proof of Theorem When c = 1, afc(l) = 1/K 
(k = l,---,Ar). Then we have ^(a(l),a(l)) = ^Ylk=iak 
from Equation 0- When c = 0, the inferred most 
efficient statistic is uniform randomly chosen, i.e., 
P{k*{cM) = k) = 1/K (k = 1,2,---,K). Then 
P(afc(0) = 1) = 1/K and E(^(a(0),iu*(0))) = 

EtiPiak{0) = l)-al = j,j:tial. 
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